{"nbformat":4,"nbformat_minor":0,"metadata":{"anaconda-cloud":{},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2"},"colab":{"name":"Tutorial VI_tf2.ipynb","provenance":[],"collapsed_sections":["SGyY2JPXtsq7","SH0SSbAftsq1","w7FnR5Kwtsq9","TvQjWvbWtsrH","HHSCevnwtsrQ","I3ojvV71pHJ6","4aqxQ_X-tsrV","7BldvImFYfm4","PCKy-0EotsrX","y7-p8ClctsrY"],"toc_visible":true},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"Lbpit0_rtspv","colab_type":"text"},"source":["# Tutorial VI: Recurrent Neural Networks"]},{"cell_type":"markdown","metadata":{"id":"C-Fr8e3Ltspx","colab_type":"text"},"source":["<p>\n","Bern Winter School on Machine Learning, 27-31 January 2020<br>\n","Prepared by Mykhailo Vladymyrov.\n","</p>\n","\n","This work is licensed under a <a href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>."]},{"cell_type":"markdown","metadata":{"id":"S0Upuaj8tspy","colab_type":"text"},"source":["In this session we will see what RNN is. We will use it to predict/generate text sequence, but same approach can be applied to any sequential data.\n"]},{"cell_type":"markdown","metadata":{"id":"7QUO3V3Stspz","colab_type":"text"},"source":["So far we looked at the data available altogether. In many cases the data is sequential (weather, speach, sensor signals etc).\n","RNNs are specifically designed for such tasks.\n","\n","<img src=\"https://scits-training.unibe.ch/data/figures/rnn.png\" alt=\"drawing\" width=\"90%\"/><br>\n","\n"]},{"cell_type":"markdown","metadata":{"id":"SGyY2JPXtsq7","colab_type":"text"},"source":["## 1. Load necessary libraries"]},{"cell_type":"code","metadata":{"id":"9jKCF9MpKWdF","colab_type":"code","colab":{}},"source":["# if using google colab\n","%tensorflow_version 2.x"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"DpRETZFNtsq7","colab_type":"code","colab":{}},"source":["import sys\n","\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import IPython.display as ipyd\n","import tensorflow as tf\n","import collections\n","import time\n","\n","# We'll tell matplotlib to inline any drawn figures like so:\n","%matplotlib inline\n","plt.style.use('ggplot')\n","\n","from IPython.core.display import HTML\n","HTML(\"\"\"<style> .rendered_html code { \n","    padding: 2px 5px;\n","    color: #0000aa;\n","    background-color: #cccccc;\n","} </style>\"\"\")\n","\n","physical_devices = tf.config.experimental.list_physical_devices('GPU')\n","tf.config.experimental.set_memory_growth(physical_devices[0], True)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"SH0SSbAftsq1","colab_type":"text"},"source":["## unpack libraries\n","if using colab, run the next cell"]},{"cell_type":"code","metadata":{"id":"Grv04xmitsq2","colab_type":"code","colab":{}},"source":["p = tf.keras.utils.get_file('./material.tgz', 'https://scits-training.unibe.ch/data/tut_files/material.tgz')\n","!mv {p} .\n","!tar -xvzf material.tgz > /dev/null  2>&1"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"rLWt72gnKj4M","colab_type":"code","colab":{}},"source":["from utils import gr_disp"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"w7FnR5Kwtsq9","colab_type":"text"},"source":["## 2. Load the text data"]},{"cell_type":"code","metadata":{"id":"N3cmvKeatsq-","colab_type":"code","colab":{}},"source":["def read_data(fname):\n","    with open(fname) as f:\n","        content = f.readlines()\n","    content = [x.strip() for x in content]\n","    content = [word for i in range(len(content)) for word in content[i].split()]\n","    content = np.array(content)\n","    return content"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"m4KJQxKqtsrA","colab_type":"code","colab":{}},"source":["training_file = 'RNN/rnn.txt'"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"mMumjsH8tsrD","colab_type":"code","colab":{}},"source":["training_data = read_data(training_file)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"Q_GMc64ptsrF","colab_type":"code","colab":{}},"source":["print(training_data[:100])"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"TvQjWvbWtsrH","colab_type":"text"},"source":["## 3. Build dataset\n","We will assign an id to each word, and make dictionaries word->id and id->word.\n","The most frequently repeating words have lowest id"]},{"cell_type":"code","metadata":{"id":"CqWfeze4tsrI","colab_type":"code","colab":{}},"source":["def build_dataset(words):\n","    count = collections.Counter(words).most_common()\n","    dictionary = {}\n","    for word, _ in count:\n","        dictionary[word] = len(dictionary)\n","    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n","    return dictionary, reverse_dictionary"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"UxjTb5VUtsrK","colab_type":"code","colab":{}},"source":["dictionary, reverse_dictionary = build_dataset(training_data)\n","vocab_size = len(dictionary)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"HJt3d4lJtsrL","colab_type":"code","colab":{}},"source":["print(dictionary)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"e1jtdvAztsrN","colab_type":"text"},"source":["Then the whole text will look as a sequence of word ids:"]},{"cell_type":"code","metadata":{"id":"sNVe-0P_tsrO","colab_type":"code","colab":{}},"source":["words_as_int = [dictionary[w] for w in training_data]\n","print(words_as_int)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"HHSCevnwtsrQ","colab_type":"text"},"source":["## 4. Build model"]},{"cell_type":"markdown","metadata":{"id":"phW3ru-dUSyv","colab_type":"text"},"source":["We will build the model in TF2.\n","It will contain an embedding layer, and three LSTM layers.\n","Dense layer on top is used to output probability of the next word:"]},{"cell_type":"code","metadata":{"id":"S3Y8GDDqjvyx","colab_type":"code","colab":{}},"source":["# Parameters\n","n_input = 3  # word sequence to predict the following word\n","\n","# number of units in RNN cells\n","n_hidden = [256, 512, 128]\n","\n","model = tf.keras.Sequential()\n","model.add(tf.keras.layers.Embedding(vocab_size, 128, input_length=n_input))\n","\n","for n_h in n_hidden:\n","  model.add(tf.keras.layers.LSTM(n_h, return_sequences=True, name='lstm%d' % n_h))\n","\n","model.add(tf.keras.layers.Dense(vocab_size, activation='softmax'))\n","\n","model.compile(optimizer='RMSProp',\n","              loss='sparse_categorical_crossentropy',\n","              metrics=['accuracy'])\n","\n","W0 = model.get_weights()  # to reset model to original state:\n","model.summary()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"I3ojvV71pHJ6","colab_type":"text"},"source":["## 5. Data streaming"]},{"cell_type":"markdown","metadata":{"id":"z_EYsL_TVaBz","colab_type":"text"},"source":["Here we will see how to feed a dataset for model training:"]},{"cell_type":"code","metadata":{"id":"muEoZW05nFQt","colab_type":"code","colab":{}},"source":["# create tf.data.Dataset object\n","word_dataset = tf.data.Dataset.from_tensor_slices(words_as_int)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"kfSyUC6CnlZ8","colab_type":"code","colab":{}},"source":["# take metod generates elements:\n","for i in word_dataset.take(5):\n","  print(reverse_dictionary[i.numpy()])"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"SBK7FWPFVwSP","colab_type":"text"},"source":["The `batch` method creates dataset, that generates sequences of elements:"]},{"cell_type":"code","metadata":{"id":"jtGgHXkPnc7P","colab_type":"code","colab":{}},"source":["sequences = word_dataset.batch(n_input+1, drop_remainder=True)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"bHOcJ_Wgn3n_","colab_type":"code","colab":{}},"source":["# helper for int-to-text conversion\n","to_text = lambda arr:' '.join([reverse_dictionary[it] for it in arr])"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"v0DYh6wNnuJr","colab_type":"code","colab":{}},"source":["for item in sequences.take(5):\n","  print(to_text(item.numpy()))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Pe6o55hyWnyX","colab_type":"text"},"source":[" The `map` method allows to use any function to preprocess the data:"]},{"cell_type":"code","metadata":{"id":"7mFJ4jNloJk7","colab_type":"code","colab":{}},"source":["def split_input_target(chunk):\n","    input_text = chunk[:-1]\n","    target_text = chunk[1:]\n","\n","    return input_text, target_text\n","\n","dataset = sequences.map(split_input_target)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"bbhGQ5k_XE5f","colab_type":"text"},"source":["The model will predict input_text -> target_text:"]},{"cell_type":"code","metadata":{"id":"6erRolZ-omdr","colab_type":"code","colab":{}},"source":["for input_example, target_example in  dataset.take(1):\n","  print ('Input data: ', to_text(input_example.numpy()))\n","  print ('Target data:', to_text(target_example.numpy()))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"74eDk9K_XXQ0","colab_type":"text"},"source":["Finally we will shuffle the items, and produce minibatches of 16 elements:"]},{"cell_type":"code","metadata":{"id":"LcDexYkspBa9","colab_type":"code","colab":{}},"source":["dataset = dataset.shuffle(10000).batch(16, drop_remainder=True)\n","dataset"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"GX7Fdz3PXzxY","colab_type":"text"},"source":["Let's test not trained model:"]},{"cell_type":"code","metadata":{"id":"QNa3Ysj3qOEh","colab_type":"code","colab":{}},"source":["for input_example_batch, target_example_batch in dataset.take(1):\n","  example_batch_predictions = model(input_example_batch)\n","  print(example_batch_predictions.shape, \"# (batch_size, sequence_length, vocab_size)\")"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"aV13OaTQqcVU","colab_type":"code","colab":{}},"source":["print('input: ', to_text(input_example_batch.numpy()[0]))\n","print('output:', to_text(target_example_batch.numpy()[0]))\n","print('pred:  ', to_text(example_batch_predictions.numpy()[0].argmax(axis=1)))\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"4aqxQ_X-tsrV","colab_type":"text"},"source":["## 5. Train!"]},{"cell_type":"code","metadata":{"id":"Q7wT782Hrq-R","colab_type":"code","colab":{}},"source":["#model.set_weights(W0)\n","history = model.fit(dataset, epochs=200, verbose=1)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"3Dan-chKsvbU","colab_type":"code","colab":{}},"source":["def draw_history(hist):\n","  fig, axs = plt.subplots(1, 2, figsize=(10,5))\n","  axs[0].plot(hist.epoch, hist.history['loss'])\n","  if 'val_loss' in hist.history:\n","    axs[0].plot(hist.epoch, hist.history['val_loss'])\n","  axs[0].legend(('training loss', 'validation loss'))\n","  axs[1].plot(hist.epoch, hist.history['accuracy'])\n","  if 'val_accuracy' in hist.history:\n","    axs[1].plot(hist.epoch, hist.history['val_accuracy'])\n","\n","  axs[1].legend(('training accuracy', 'validation accuracy'))\n","  plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"iAJu4rMys3da","colab_type":"code","colab":{}},"source":["draw_history(history)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"7BldvImFYfm4","colab_type":"text"},"source":["## 6. Generating text with RNN"]},{"cell_type":"markdown","metadata":{"id":"b3kuXO1jYu-a","colab_type":"text"},"source":["Take word sequence and generate the following 128 words:"]},{"cell_type":"code","metadata":{"id":"uyTL_hrbxwj-","colab_type":"code","colab":{}},"source":["def gen_long(model, word_id_arr, n_words=128):\n","  out = []\n","  words = list(word_id_arr.copy())\n","  for i in range(n_words):\n","      keys = np.reshape(np.array(words), [-1, n_input])\n","\n","      onehot_pred = model(keys).numpy()[0]\n","      pred_index = onehot_pred.argmax(axis=1)\n","      pred = pred_index[-1]\n","      out.append(pred)\n","\n","      words = words[1:]\n","      words.append(pred)\n","  sentence = to_text(out)\n","  return sentence"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"R7gzaQ1FyRCE","colab_type":"code","colab":{}},"source":["for input_example_batch, target_example_batch in dataset.take(10):\n","  input_seq = input_example_batch.numpy()[0]\n","  sentence = gen_long(model, input_seq)\n","  print(to_text(input_seq), '...')\n","  print('\\t...', sentence, '\\n')"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"wGv_d3bzZB9a","colab_type":"text"},"source":["Or try to input some text and see continuation:"]},{"cell_type":"code","metadata":{"code_folding":[],"id":"_iX7hcrFtsrW","colab_type":"code","colab":{}},"source":["while True:\n","    prompt = \"%s words: \" % n_input\n","\n","    try:\n","      sentence = input(prompt)\n","    except KeyboardInterrupt:\n","      break\n","\n","    sentence = sentence.strip()\n","    words = sentence.split(' ')\n","    if len(words) != n_input:\n","        continue\n","    try:\n","        symbols_in_keys = [dictionary[str(words[i])] for i in range(len(words))]\n","    except:\n","        print(\"Word not in dictionary\")\n","        continue\n","\n","    sentence = gen_long(model, symbols_in_keys)\n","    print(sentence)\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"PCKy-0EotsrX","colab_type":"text"},"source":["## 7. Excercice \n"]},{"cell_type":"markdown","metadata":{"id":"OYXY5jfMAi6x","colab_type":"text"},"source":["* Run with 5-7 input words instead of 3.\n","* increase number of training iterations, since convergance will take much longer (training as well!)."]},{"cell_type":"markdown","metadata":{"id":"y7-p8ClctsrY","colab_type":"text"},"source":["## 8. Further reading"]},{"cell_type":"markdown","metadata":{"id":"mjhwEgdKtsrZ","colab_type":"text"},"source":["[Illustrated Guide to Recurrent Neural Networks](https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9)\n","\n","[Illustrated Guide to LSTM’s and GRU’s: A step by step explanation](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)"]}]}